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Abstract. Generalised Mcrsenne Numbers (GMNs) were defined by Solinas 
in 1999 and feature in the NIST (FIPS 186-2) and SECG standards for use 
in elliptic curve cryptography. Their form is such that modular reduction is 
extremely efficient, thus making them an attractive choice for modular mul- 
tiplication implementation. However, the issue of residue multiplication effi- 
ciency seems to have been overlooked. Asymptotically, using a cyclic rather 
than a linear convolution, residue multiplication modulo a Mersenne num- 
ber is twice as fast as integer multiplication; this property does not hold for 
prime GMNs, unless they are of Mersenne's form. In this work we exploit 
an alternative generalisation of Mersenne numbers for which an analogue of 
the above property — and hence the same efficiency ratio — holds, even at 
bitlengths for which schoolbook multiplication is optimal, while also main- 
taining very efficient reduction. Moreover, our proposed primes are abundant 
at any bitlength, whereas GMNs are extremely rare. Our multiplication and 
reduction algorithms can also be easily parallelised, making our arithmetic 
particularly suitable for hardware implementation. Furthermore, the field rep- 
resentation we propose also naturally protects against side-channel attacks, 
including timing attacks, simple power analysis and differential power analy- 
sis, which is essential in many cryptographic scenarios, in constrast to GMNs. 



1. Introduction 

The problem of how to efficiently perform arithmetic in Z/iVZ is a very natural 
one, with numerous applications in computational mathematics and number theory, 
such as primality proving [1], factoring [45], and coding theory [68], for example. It 
is also of central importance to nearly all public-key cryptographic systems, includ- 
ing the Digital Signature Algorithm [24] , RSA [54] , and elliptic curve cryptography 
(ECC) [9]. As such, from both a theoretical and a practical perspective it is inter- 
esting and essential to have efficient algorithms for working in this ring, for either 
arbitrary or special moduli, with the application determining whether generality 
(essential for RSA for instance), or efficiency (desirable for ECC) takes precedence. 

Two intimately related factors need consideration when approaching this prob- 
lem. First, how should one represent residues? And second, how should one perform 
arithmetic on these representatives? A basic answer to the first question is to use 
the canonical representation Z/AZ = {0, ...,N — 1}. With regard to modular 
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multiplication for example, an obvious answer to the second question is to per- 
form integer multiplication of residues, followed by reduction of the result modulo 
N, in order to obtain a canonical representative once again. Using this approach, 
the two components needed for efficient modular arithmetic are clearly fast integer 
arithmetic, and fast modular reduction. 

At bitlengths for which schoolbook multiplication is optimal, research on fast 
modular multiplication has naturally tended to focus on reducing the cost of the 
reduction step. For arbitrary moduli, Montgomery's celebrated algorithm [49] en- 
ables reduction to be performed for approximately the cost of a residue by residue 
multiplication. For the Mersenne numbers Mk = 2 k — 1, efficient modular mul- 
tiplication consists of integer residue multiplication to produce a 2/c-bit product 
U ■ 2 k + L, with U, L of at most fc-bits, followed by a single modular addition 
U + L mod Mk to effect the reduction, as is well known. In 1999 Solinas proposed 
an extension of this method to a larger class of integers: the Generalised Mersenne 
Numbers (GMNs) [60]. As they are a superset, GMNs are more numerous than the 
Mersenne numbers and hence contain more primes, yet incur little additional over- 
head in terms of performance [11]. In 2000, NIST recommended ten fields for use 
in the ECDSA: five binary fields and five prime fields, and due to their performance 
characteristics the latter of these are all GMNs [24], which range from 192 to 521 
bits in size. The Standards for Efficient Cryptography Group also recommended 
the same five prime fields in 2010 [12]. 

For the GMNs recommended by NIST, there is no interplay between the residue 
multiplication and reduction algorithms, each step being treated separately with re- 
spect to optimisation. On the other hand, at asymptotic bitlengths the form of the 
modulus may be effectively exploited to speed up the residue multiplication step. 
For the Mersenne numbers Mk in particular, modular multiplication can be per- 
formed for any k using a cyclic convolution effected by an irrational-base discrete 
weighted transform (IBDWT) [16, §6] (see also [17, §9.5.2-9.5.4] for an excellent 
overview of discrete Fourier transform-based multiplication methods, convolution 
theory and IBDWTs). As such, multiplication modulo Mersenne numbers is ap- 
proximately twice as fast as multiplication of integers of the same bitlength, for 
which a linear convolution is required, as each multiplicand must be padded with 
k zeros before a cyclic convolution of length 2k can be performed. For Mont- 
gomery multiplication at asymptotic bitlengths, the reduction step can be made 
25% cheaper, again by using a cyclic rather than a linear convolution for one of the 
required multiplications [53]. However, since the multiplication step is oblivious to 
the form of the modulus, it seems unlikely to possess the same efficiency benefits 
that the Mersenne numbers enjoy. These considerations raise the natural question 
of whether there exists a similar residue multiplication speed-up at bitlengths for 
which schoolbook multiplication is optimal? Certainly for the modulus N = 2 k , 
such a speed-up can be achieved, since the upper half words of the product can 
simply be ignored. However, this modulus is unfortunately not at all useful for 
ECC. 

In this work we answer the above question affirmatively, using an alternative 
generalisation of Mersenne numbers, which has several desirable features: 

— Simple. Our proposed family is arguably a far more natural generalisation of 
Mersenne numbers than Solinas', and gives rise to beautiful multiplication and 
reduction algorithms. 
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— Abundant. Our primes are significantly more numerous than the set of prime 
GMNs and are abundant for all tested bitlcngths; indeed their number can be 
estimated using Bateman and Horn's quantitative version [3] of Schinzel and 
Sicrpihski's "Hypothesis H" [56]. 

— Fast multiplication. Our residue multiplication is nearly twice as fast as mul- 
tiplication of integer residues. 

— Fast reduction. Our reduction has linear complexity and is particularly effi- 
cient for specialised parameters, although such specialisation comes at the cost 
of reducing the number of primes available. 

— Parallelisable. Both multiplication and reduction can be easily parallelised, 
making our arithmetic particularly suitable for hardware implementation. 

— Side-channel secure. Our representation naturally protects against well-known 
side-channel attacks on ECC (see [10, ch. IV] for an overview), in contrast to 
the NIST GMNs, see [55] and [58, §3.2]. This includes timing attacks [40, 64], 
simple power analysis [55] and differential power analysis [41]. 

This article provides an introductory (and comprehensive) theoretical framework 
for the use of our proposed moduli. It thus serves as a foundation for a new ap- 
proach to the secure and efficient implementation of prime fields for ECC, both in 
software and in hardware. At a high level, our proposal relies on the combination of 
a remarkable algebraic identity used by Nogami, Saito, and Morikawa in the context 
of extension fields [51], together with the residue representation and optimisation 
of the reduction method proposed by Chung and Hasan [15], which models suitable 
prime fields as the quotient of an integer lattice by a particular equivalence relation. 
To verify the validity of our approach, we also provide a proof-of-concept implemen- 
tation that is already competitive with the current fastest modular multiplication 
algorithms at contemporary ECC security levels [5, 26, 32, 46, 25, 6]. 

The sequel is organised as follows. In §2 we present some definitions and recall 
related work. In §3 we describe the basis of our arithmetic, then in §4-6 we present 
details of our residue multiplication, reduction and representation respectively In 
§7 we show how to ensure I/O stability for modular multiplication, then in §8 we put 
everything together into a full modular multiplication algorithm. We then address 
other arithmetic operations and give a brief treatment of side-channel secure ECC 
in §9, and in §10 show how to generate suitable parameters. In §11 we present our 
implementation results and finally, in §12 we draw some conclusions. 

2. Definitions and Related Work 

In this section we introduce the cyclotomic primes and provide a summary of 
related work. We begin with the following definition. 

Definition 2.1. For n > 1 let Cn be a primitive n-th root of unity. The n-th 
cyclotomic polynomial is defined by 

$„(x)= n (z-c)=Y[^-x n,d Y {d \ 

(k,n) — l d\n 

where [i is the Mobius function. 

Two basic properties of the cyclotomic polynomials are that they have integer 
coefficients, and are irreducible over Z. These two properties ensure that the eval- 
uation of a cyclotomic polynomial at an integer argument will also be an integer, 



4 



ROBERT GRANGER AND ANDREW MOSS 



and that this integer will not inherit a factorisation from one in Z[.t]. One can 
therefore ask whether or not these polynomials ever assume prime values at integer 
arguments, which leads to our next definition. 

Definition 2.2. For n > 1 and t € Z, if p = is prime, we call p an n-th 

cyclotomic prime, or simply a cyclotomic prime. 

Note that for all primes p, we have p = 1) = ^{p — 1), and so trivially all 

primes are cyclotomic primes. These instances are also trivial in the context of the 
algorithms we present for performing arithmetic modulo these primes, since in both 
cases the cyclotomic polynomials are linear and our algorithms reduce to ordinary 
Montgomery arithmetic. Hence for the remainder of the article we assume n > 3. 

In addition to being prime-evaluations of cyclotomic polynomials, note that for a 
cyclotomic prime p = the field F p can be modelled as the quotient of the ring 

of integers of the n-th cyclotomic field Q(Cn), by the prime ideal n = (p,(n—t)- This 
is precisely how one would represent ¥ p when applying the Special Number Field 
Sieve to solve discrete logarithms in F p , for example [44]. Hence our nomenclature 
for these primes seems apt. This interpretation of ¥ p for p a cyclotomic prime is 
implicit within the arithmetic we develop here, albeit only insofar as it provides 
a theoretical context for it; this perspective offers no obvious insight into how to 
perform arithmetic efficiently and the algorithms we develop make no use of it 
at all. Similarly, the method of Chung and Hasan [15] upon which our residue 
representation is based can be seen as arising in exactly the same way for the much 
larger set of primes they consider, with the field modelled as a quotient of the ring 
of integers of a suitable number field by a degree one prime ideal, just as for the 
cyclotomic primes. 

2.1 . Low redundancy Cyclotomic Primes. The goal of the present work is to 
provide efficient algorithms for performing ¥ p arithmetic, for p = a cyclotomic 

prime. As will become clear from our exposition, in order to exploit the available 
cyclic structure — for both multiplication and reduction — we do not use the field 
Z/$„(i)Z, but instead embed into the slightly larger ring Z/(i n — 1)Z if n is odd, and 
Z/(i™/ 2 + 1)Z if n is even. In each case, using the larger ring potentially introduces 
an expansion factor e(n) into the residue representation. One can alternatively 
view this in terms of a redundancy measure r(n), where r = e — 1. Since using a 
larger ring for arithmetic will potentially be slower, we now identify three families 
of cyclotomic polynomials for which the above embeddings have low redundancy. 

For n even, there is a family of cases for which the above embedding does not 
introduce any redundancy, namely for n = 2 fe , since <& 2 k (t) = t 2 + 1 = t 2 / 2 + 1, 
and hence e = 1 and r = 0. When t — 2 these are of course the Fermat numbers, 
and for general t these integers are known as Generalised Fermat Numbers (GFNs). 
It is expected that for each k there arc infinitely many t for which t 2 +1 is prime [20, 
§3]- 

If n = 2pforpprime, then $ 2p (i) = i p_1 -i p_2 H hi-1 = (t'P + 1) / (t+1) and 

in this case e = p/(p — 1) and r = l/(p — 1). The primality of these numbers was 
studied in [21] , and while they apparently do not have a designation in the literature, 
one can see that by substituting t with —t in the third family below produces this 
one. For general even n we have e = n/2<p(n) and r = (n — 2(j){n))/2(f)(n), with <f>(-) 
Euler's totient function, which is the degree of Hence amongst those even n 

which are not a power of 2, this family produces the successive local minima of r. 
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For odd n, we have e = n/<p(n) and r = (n — 4>{n)) / '<p(n) . The successive local 
minima of r occur at n — p for p prime, in which case 3> P (i) = i p_1 + t p ~ 2 + • • • + 
t + 1 = (t p - l)/(t- 1), also with r = l/(p— 1). When t = 2 these are of course the 
Mersenne numbers, and in analogy with the case of Fermat numbers, it would be 
natural to refer to these integers for general t as Generalised Mersenne Numbers, 
particularly as one can show they share the aforementioned asymptotic efficiency 
properties of the Mersenne numbers, while Solinas' GMNs do not, unless they are 
of Mersenne's form. However, this family of numbers is known in the literature as 
generalised repunits [65, 59, 19], since their b&se-t expansion consists entirely of l's. 
Therefore for the sake of uniform nomenclature, we use the following definition. 

Definition 2.3. For to + 1 an odd prime let 

p = $ m+ i(i) = t m + t m - x + ■ ■ ■ + t + 1. 

We call such an integer a Generalised Repunit (GR); when p is prime we call it a 
Generalised Repunit Prime ( GRP). 

We have developed modular multiplication algorithms for both GRPs and GFNs. 
In terms of efficiency, for GRPs and GFNs of the same bitlcngth the respective 
multiplication algorithms require exactly the same number of word-by-word multi- 
plications. Also, our reduction algorithms for both GRPs and GFNs are virtually 
identical. However, the multiplication algorithm for GFNs is far less elegant, is 
not perfectly parallelisable and contains more additions. Furthermore, for a given 
bitlcngth there are fewer efficient GFN primes than there are GRPs — as the 
bitlcngth of GFNs doubles as k is incremented — and the I/O stability analysis 
for multiplication modulo a GRP is far simpler. Therefore in this exposition we 
focus on algorithms for performing arithmetic modulo GRPs and their analysis 
only. Note that the studies of GRPs [19, 65] consider only very small t and large 
to, whereas we will be interested in t approximately the word base of the target 
architecture, and m the number of words in the prime whose field arithmetic we 
arc to implement. Hence one expects (and finds) there to be very many GRPs for 
any given relevant bitlength, see §10. 

2.2. Related work. In the context of extension fields, let m + 1 be prime and let 
p be a primitive root modulo m + 1. Then F p ™ = F p [x]/($ m +i(x)F p [x]). In the 
binary case, i.e., p = 2, several authors have proposed the use of this polynomial 
- also known as the all-one polynomial (AOP) — to obtain efficient multiplica- 
tion algorithms [34, 66, 8, 57]. All of these rely on the observation that the field 
F 2 [x]/($ m+ i(x)F 2 [x]) embeds into the ring F 2 [x]/((x m+1 + l)F 2 [x]) — referred to 
by Silverman [57] as the "ghost bit" basis — which possesses a particularly nice 
cyclic structure, but introduces some redundancy. Similarly, this idea applies to any 
cyclotomic polynomial, and several authors have investigated this strategy, embed- 
ding suitably defined extension fields into the ring F 2 [x]/((x™ + l)F 2 [x]) [22, 27, 67]. 

For odd characteristic extension fields, Silverman noted that the "ghost bit" 
basis for p = 2 extends easily to larger p [57], while Kwon et al. have explored 
this idea further [42] . Central to our application is the work of Nogami, Saito and 
Morikawa [51], who used the AOP to obtain a very fast multiplication algorithm, 
see §4. The use of cyclotomic polynomials in extension field arithmetic is therefore 
well studied. In the context of prime fields however, the present work appears to be 
the first to transfer ideas for cyclotomic polynomials from the domain of extension 
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field arithmetic to prime field arithmetic, at least for the relatively small bitlengths 
for which schoolbook multiplication is optimal. 

With regard to the embedding of a prime held into a larger integer ring, the idea 
of operand scaling was introduced by Walter in order to obtain a desired represen- 
tation in the higher-order bits [61], which aids in the estimation of the quotient 
when using Barrett reduction [2]. Similarly, Ozturk et al. proposed using helds 
with characteristics dividing integers of the form 2 k ± 1, with particular applica- 
tion to ECC [52]. As stated in the introduction, there are numerous very efficient 
prime field ECC implementations [5, 26, 32, 46, 6]. While the moduli used in these 
instances permit fast reduction algorithms, and the implementations are highly op- 
timised, it would appear that none of them permit the same residue multiplication 
speed-up that we present here, which is one of the central distinguishing features 
of the present work. 

3. GRP Field Representation 

In this section we present a sequence of representations of F p , with p a GRP, 
the final one being the target representation which we use for our arithmetic. We 
recall the mathematical framework of Chung-Hasan arithmetic, in both the general 
setting and as specialised to GRPs, focusing here on the underlying theory, deferring 
explicit algorithms for residue multiplication, reduction and representation until §4- 
6. 

3.1. Chung-Hasan arithmetic. We now describe the ideas behind Chung-Hasan 
arithmetic [13, 14, 15]. The arithmetic was developed for a class of integers they 
term low- weight polynomial form integers (LWPFIs), whose definition we now re- 
call. 

Definition 3.1. An integer p is a low- weight polynomial form integer (LWPFI), if 

it can be represented by a monic polynomial /(<) = t n + f n -it n ^ 1 H h fit + fo, 

where t is a positive integer and < £ for some small positive integer £ < t. 

Note that if for a given LWPFI each fo e {±1, 0} and t = 2 fc , then it is a GMN, as 
defined by Solinas [60]. The key idea of Chung and Hasan is to perform arithmetic 
modulo p using representatives from the polynomial ring Z[T]/(/(T)Z[T]). To 
do so, one uses the natural embedding ijj : F p Z[T]/(/(T)Z[T]) obtained by 
taking the base t expansion of an element of F p in the canonical representation 
F p = {0, ... ,p — 1}, and substituting T for t. To compute ip~ 1 one simply makes 
the inverse substitution and evaluates the expression modulo p. 

The reason for using this ring is straightforward: since ip~ l is a homomor- 
phism, when one computes z(T) = x(T) ■ y(T) in Z[T], reducing the result modulo 
f{T) to give w(T) does not change the element of F p represented by z(T), i.e., if 
z(T) = w(T) (mod f(T)), then z(t) = w(t) (mod p), since p = f(t). Furthermore, 
since f(T) has very small coefficients, w(T) can be computed from z(T) using only 
additions and subtractions. Hence given the degree 2(n — 1) product of two degree 
n — 1 polynomials in Z[T], its degree n — 1 representation in Z[T]/(/(T)Z[T]) can 
be computed very efficiently. Note that for non-low-weight polynomials this would 
no longer be the case. 

The only problem with this approach is that when computing z(T) as above, 
the coefficients of z(T), and hence w(T), will be approximately twice the size of 
the inputs' coefficients, and if further operations are performed the representatives 
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will continue to expand. Since for I/O stability one requires that the coefficients 
be approximately the size of t after each modular multiplication or squaring, one 
must somehow reduce the coefficients of w{T) to obtain a standard, or reduced 
representative, while ensuring that tp~ 1 (w(T)) remains unchanged. 

Chung and Hasan refer to this issue as the coefficient reduction problem (CRP), 
and developed three solutions in their series of papers on LWPFI arithmetic [13, 
14, 15]. Each of these solutions is based on an underlying lattice, although this was 
only made explicit in [15]. Since the lattice interpretation is the most elegant and 
simplifies the exposition, in the sequel we opt to develop the necessary theory for 
GRP arithmetic in this setting. 

3.2. Chung-Hasan representation for GRPs. Let p = <E> m+ i(i) be a GRP. 
Our goal is to develop arithmetic for ¥ p , and we begin with the canonical rep- 
resentation F p = Z/<f> m+ i(t)Z. As stated in §2.1, the first map in our chain of 
representations takes the canonical ring and embeds it into Z/(i m+1 — 1)Z, for 
which the identity map suffices. To map back, one reduces a representative modulo 
p. We then apply the Chung-Hasan transformation of §3.1, which embeds the sec- 
ond ring into Z[T]/(T m+1 — 1)Z[T], by taking the base t expansion of a canonical 
residue representative in Z/ (t m+1 — 1)Z, and substituting T for t. Wc call this map 
tp. To compute ip -1 one simply makes the inverse substitution and evaluates the 
expression modulo t m+1 — 1. 

Note that the codomain of ip may be regarded as an (m + l)-dimensional vector 
space over Z, equipped with the natural basis {T m , . . . ,T, 1}. In particular, for 
x(T) e Z[T]/{T m+1 - 1)Z[T], where 

x(T) = x m T m + ... + Xl T + x , 

one can consider x(T) to be a vector x = [x m , . . . ,x ] e Z m+1 . Since Z m+1 has 
elements whose components are naturally unbounded, for each x € Z/(i m+1 — 1)Z 
there are infinitely many elements of Z m+1 that map via i/> _1 to x. Therefore in 
order to obtain a useful isomorphism directly between Z/(£ m+1 — 1)Z and Z m+1 , 
we identify two elements of Z m+1 whenever they map via ip -1 to the same element 
of Z/(t m+1 - 1)Z, i.e., 

(3.1) x~y <=s. ^(x) ^^(y) (mod t m+1 - 1), 

and take the image of ip to be the quotient of Z m+1 by this equivalence relation. 
Pictorially, we thus have: 

F p C Z/(t m+1 - 1)Z = z m+1 / ~ 

As mentioned in §3.1, for each coset in Z m+1 / ~, we should like to use a minimal, 
or in some sense 'small' representative, in order to facilitate efficient arithmetic after 
a multiplication or a squaring, for example. Since we know that the base-i expansion 
of every x € Z/ (t m+1 — 1)Z gives one such representative for each coset in Z m+1 / ~, 
for a reduction algorithm we just need to be able to find it, or at least one whose 
components are of approximately the same size. Chung and Hasan related finding 
such 'nice' or reduced coset representatives to solving a computational problem in 
an underlying lattice, which we now recall. 

3.3. Lattice interpretation. Given an input vector z, which is the output of a 
multiplication or a squaring, a coefficient reduction algorithm should output a vec- 
tor w such that w ~ z, in the sense of (3.1), whose components are approximately 



8 



ROBERT GRANGER AND ANDREW MOSS 



the same size as t. As observed in [15], the equivalence relation (3.1) is captured 
by an underlying lattice, and finding w is tantamount to solving an instance of the 
closest vector problem (CVP) in this lattice. To see why this is, we first fix some 
notation as in [15]. 

Let u and v be vectors in Z m+1 such that the following condition is satisfied: 



Then we say that u is congruent to v modulo t m+1 — l and write this as u = tm+ i_ 1 v. 
Note that this is exactly the same as saying ^> _1 (u) = ?/> _1 (v) (mod t m+1 — 1), and 

SOU~V U = tm +l_! v. 

Similarly, but abusing notation slightly, for any integer b ^ t m+1 — 1 (where b 
is typically a power of the word base of the target architecture), we write u =j v 
for some integer v satisfying [t m , . . . , t, 1] • u T = v (mod b), and say u is congruent 
to v modulo b, in this case. We reserve the use of '=' to express a component- 
wise congruence relation, i.e., u = v (mod b). Finally, we denote by u mod b the 
component-wise modular reduction of u by b. 

The lattice underlying the equivalence relation (3.1) can now enter the frame. 
Let V = {v , . . . , v m } be a set of m+ 1 linearly independent vectors in Z m+1 such 
that Vj = t m+i_i 0, the all zero vector, for i = 0, . . . , m. Then the set of all integer 
combinations of elements of V forms an integral lattice, £(V), with the property 
that for all z G Z m+1 , and all u G C, we have 

(3.2) z + u= t m+i_!Z 

In particular, the equivalence relation (3.1) is captured by the lattice £, in the sense 
that 



Therefore if one selects basis vectors for L that have infinity-norm approximately 
t, then for a given z G Z m+1 , finding the closest vector u G £ to z (with respect 
to the Loo-norm), means the vector w = z — u is in the fundamental domain of C, 
and so has components of the desired size. Furthermore, since w = z u, by (3.2) 
we have 



and hence solving the CVP in this lattice solves the CRP. In general solving the 
CVP is NP-hard, but since we can exhibit a good (near-othogonal) lattice basis for 
LWPFIs, and an excellent lattice basis for GRPs, solving it is straightforward in 
our case. 

3.4. Lattice basis and simple reduction. For GRPs, we use the following basis 



[r,...,t,i]-u r = [t 1 



.-m 



...,M]-v T (modt m+1 -l) 




W = t m+l_l Z, 



for C: 



1 

-t 





1 

-t 















-t 






(3.3) 













-t 





1 

-t 





1 



Observe that the infinity-norm of each basis vector is t, so elements in the funda- 
mental domain will have components of the desired size, and that each basis vector 
is orthogonal to all others except the two adjacent vectors (considered cyclically). 
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In order to perform a simple reduction that reduces the size of components by ap- 
proximately log 2 t bits, write each component of z in base t: zi = z^\t + Zi$. If we 



Zjn 
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Z\ 
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•• 


• 





-t 


-t 


1 •• 


• 











-t ■ ■ 


• 











•• 


• -t 


1 








•• 
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-t 


1 



Zm-1,1 
Zm-2,1 



Z0,1 
Zm,\ 



then w = t m+i_i z and each \un\ « \zi\/t, assuming \zi\ > t 2 . This was the method 
of reduction described in [13], which requires integer division. The idea described 
in [14] was based on an analogue of Barrett reduction [2]. The method we shall 
use, from [15], is based on Montgomery reduction [49] and for t not a power of 2 is 
the most efficient of the three Chung-Hasan methods. 

3.5. Montgomery lattice-basis reduction. In ordinary Montgomery reduction [49], 
one has an integer < Z < pR which is to be reduced modulo p, an odd prime, 
where here R is the smallest power of the word base b larger than p. The central 
idea is to add a multiple of p to Z such that the result is divisible by R. Upon 
dividing by R, which is a simple right shift of words, the result is congruent to 
ZR~ X (mod p), and importantly is less than 2p. 

In the context of GRPs, let R = b q be the smallest power of b greater than 
t. The input to the reduction algorithm is a vector z G Z m+1 for which each 
component is approximately R 2 . The natural analogue of Montgomery reduction 
is to add to z a vector u g C whose components are also bounded by R 2 , such that 
z + u = [0, . . . , 0] (mod R). Then upon the division of each component by R, the 
result will be a vector w which satisfies 

W = t m+l_l (Z + TI) • R^ 1 = t m+l_! Z • i2 _1 , 

and which has components of the desired size. While this introduces an term 
into the congruence, as with Montgomery arithmetic, one circumvents this simply 
by altering the original coset representation of Z/(i m+1 — 1)Z, via the map x i-> xR 
(mod t m+1 — 1), which is bijective since gcd(t m+1 — 1, R) = 1, assuming t is even, 
see §5. How then does one find a suitable lattice point u? For this one use the 
lattice basis (3.3), which from here on in we call L. Proposition 3 of [15] proves 
that det L = 1 — t m+1 , and so gcd(det L, R) = 1. One can therefore compute 

dcf 



(3.4) u 1 = —L~ ■ z (modi?), 

(3.5) w T (z T + L-u T )/R, 

giving w with the required properties. Observe that the form of these two opera- 
tions is identical to Montgomery reduction, the only difference being that integer 
multiplication is replaced by matrix by vector multiplication. It is easy to see that 
this is what one requires, since for any u 6 Z m+1 , we have L ■ u T € C, and so 

z T + L ■ u T = tm +i_i z T . 

Furthermore, modulo R we have 

z T + L ■ u T = z T + L ■ (-L- 1 ■ z T mod R) = [0, . . . , 0] T , 
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ensuring the division of each component by R is exact. Hence w = tm +i_i z • R , 
as claimed. 

In [15], an algorithm was given for computing u and w in (3.4) and (3.5) re- 
spectively, for an arbitrary LWPFI f(t). The number of word-by- word multiply 
instructions in the algorithm — which is the dominant cost — is ~ nq 2 , where n 
is the degree of f(t), and R = b q . In comparison, for ordinary Montgomery reduc- 
tion modulo an integer of equivalent size this number is n 2 q 2 , making the former 
approach potentially very attractive. For our choice of primes — the GRPs — our 
specialisation of this algorithm is extremely efficient, as we show in §5. 

3.6. High level view of Chung Hasan-arithmetic. For extension fields, there 
exists a natural separation between the polynomial arithmetic of the extension, and 
the prime subfield arithmetic, which makes respective optimisation considerations 
for each almost orthogonal. On the other hand, if for an LWPFI one naively 
attempts to use efficient techniques that are valid for extension fields, then one 
encounters an inherent obstruction, namely that there is no such separation between 
the polynomial arithmetic and the coefficient arithmetic, which leads to coefficient 
expansion upon performing arithmetic operations. Chung-Hasan arithmetic can be 
viewed as a tool to overcome this obstruction, since it provides an efficient solution 
to the coefficcnt reduction problem. In practice therefore any efficient techniques 
for extension field arithmetic can be ported to prime fields, whenever the prime is 
an LWPFI, which is precisely what we do in §4. 



In this section we detail algorithms for performing multiplication of GRP residue 
representatives. While for the reduction and residue representation we consider 
elements to be in Z m+1 , the multiplication algorithm arises from the arithmetic of 
the polynomial ring Z[T]/(T m+1 — 1)Z[T], and so here we use this ring to derive 
the multiplication formulae. 

4.1. Ordinary multiplication formulae. Let K = Z[T]/(T m+1 - 1)Z[T], and 
let x = [x m , . . . , xo] and y = \y m , . . . , y ] be elements in 1Z. Then in 1Z the product 
x • y is equal to \z m , • • • , zq], where 



where the subscript (i) denotes i (mod m+1). This follows from the trivial property 



T rn+1 = 1 ( mod T m+1 _ ^ and that for ^ = £>m ^ ^rpi &nd - = y . T 3^ we 



4. GRP Multiplication 



rn 



(4.1) 




have: 



m mm 



xy = 




mm mm 



i=0 j=0 ' j=0 i=0 



This is of course just the cyclic convolution of x and y. 
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4.2. Multiplication formulae of Nogami et al. Nogami, Saito and Morikawa 
proposed the use of all-one polynomials (AOPs) to define extensions of prime 
fields [51]. In this section we will first describe their algorithm in this context, 
and then show how it fits into the framework developed in §3. 

Let F p be a prime field and let /(w) = w m + cj m_1 + • ■ • + u> + 1 be irre- 
ducible over F p , i.e., to + I is prime and p is a primitive root modulo to + 1. Then 
F pm = ¥ p [oj]/(f(oj)¥ p [oj]). Using the polynomial basis {w m ,w m ~ 1 , . . . ,u} — rather 
than the more conventional {w m_1 , . . . , u, 1} — elements of F p ™ are represented as 
vectors of length to over F p : 

x = [x m , . . . , xi] = x m uj m + Xm-lLj" 1 ^ 1 H h XlCJ. 

Let x = [x m , . . . , x\] and y = [y m , . . . , j/i] be two elements to be multiplied. For 
< i < to, let 

m/2 

(4.2) qi = £> <4+j) - x a-j)^y{i+ 3 ) - w<*-i>)> 

where the subscript {i) here, as in §4.1, denotes i (mod to + 1). One then has: 

m 

(4.3) z = x y = '^z i u\ with z { = q - qi. 

i=i 

Nogami et al. refer to these coefficient formulae as the cyclic vector multiplication 
algorithm (CVMA) formulae. The CVMA formulae are remarkable, since the num- 
ber of Fp multiplications is reduced relative to the schoolbook method from m 2 to 
to(to + I )/2, but at the cost of increasing the number of F p additions from m 2 — 1 
to 3to(to — l)/2 — I. As alluded to in §3.6, a basic insight of the present work is 
the observation that one may apply the expressions in (4.2) to GRP multiplication, 
provided that one uses the Chung-Hasan representation and reduction methodology 
of §3, to give a full modular multiplication algorithm. 

Note that Karatsuba-Ofman multiplication [36] offers a similar trade-off for ex- 
tension field arithmetic. Crucially however, as we show in §4.6, when we apply 
these formulae to GRPs the number of additions required is in fact reduced. One 
thus expects the CVMA to be significantly more efficient at contemporary ECC 
bitlengths. The original proof of (4.3) given in [51] excludes some intermediate 
steps and so for the sake of clarity we give a full proof in §4.4, beginning with the 
following motivation. 

4.3. Alternative bases. Observe that in the set of equations (4.2), each of the 
2(to + I) coefficients xj, yj is featured m + 1 times, and so there is a nice symmetry 
and balance to the formulae. However due to the choice of basis, both xq and yo 
are implicitly assumed to be zero. The output z naturally has this property also, 
and indeed if one extends the multiplication algorithm to compute zq we see that 
it equals <7o — ?o = 0. 

At first sight, the expression Zi = qo — qi may seem a little unnatural. It is easy 
to change the basis from {u) m , . . . , cj} to {o;™^ 1 , . . . , w, 1}: for x = [x m -i, ■ ■ ■ ,x n ] 
and y = [y m -i, . . . , yo], we have: 

m— 1 

z = x y = ^2 z * w *> 

i=0 
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resulting in the expressions z% = q m — Qi, with q\ as given before. This change of 
basis relies on the relation 

(4.4) oj m = -l-oj w" 1 - 1 mod /(w). 

Note that in using this basis we have implicitly ensured that x m = y m = in (4.2), 
rather than xo = ya = 0, and again the above formula is consistent since z m — 
q m — q m = 0. More generally if one excludes ur from the basis, then xt = yk = 
and Zi = q k - qi- 

One may infer from these observations that the most natural choice of basis 
would seem to be {oj m , . . . ,u>, 1}, and that the expressions for qi arise from the 
arithmetic in the quotient ring TV = ¥ p [oj]/((uj m+1 — 1)¥ p [uj]), rather than F pm = 
¥ p [u]/(f(uj)¥p[uj]). In this case multiplication becomes 

m—1 m — 1 m 

z = x • y = ^2 z i w% = X ( qm ~ q ^ w ' = X! ~*^\ 
i=a i=o i=o 

where for the last equality we have again used equation (4.4). 



4.4. Derivation of coefficient formulae. We now derive the CVMA formulae 
of (4.2). Let x = [x m ,---,xo] = E™ a;iW J , and y = [y m ,...,y ] = Y^oVi^ 1 - 
Then in the ring TV ', as in (4.1) the product x • y is equal to Y^T=o z i U)% i where 



Of crucial importance is the following identity. For < i < m we have: 

m mm 

(4.5) 2^i y) i/ (H) -2^x (o) y (o) = -^( x 0> ~ x {i-j))(V(i) -»<i-j>)- 

j=0 j=0 j=0 

To verify this identity observe that when one expands the terms in the right-hand 
side, the two negative sums cancel with the second term on the left-hand side, since 
both are over a complete set of residues modulo m + 1. Similarly the two positive 
sums arc equal and therefore cancel with the convolutions in the first term on the 
left-hand side. We now observe that there is some redundancy in the right-hand 
side of (4.5), in the following sense. First, observe that 

mm m 
3=0 j=0 3=0 

One can therefore rewrite the right-hand side of (4.5) as: 

m 

(4.6) - £><4+j> - x a-3))(y(i+i) - va-3))- 

3=0 

Noting that the j = term of expression (4.6) is zero, we rewrite it as: 

m/2 m 

~Y.( x (i+3)- x a-3))ty(i+3)-y^-3))- E ( x a+3)- x {i-3))(y(i+3)-y{i-3))' 

J=l j=m/2+l 
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which in turn becomes 

m/2 m/2 

3=1 i=i 
and then upon negating the two terms in the second summation, we finally have 

m m/2 
j=0 J=l 

Hence (4.5) becomes 

m m m/2 

(4.7) ^^(i)!/(H) =E x O')f(i) - E( x (l+i) -f(l-i))- 

3=0 j=0 j = l 

Equation (4.7) gives an expression for the coefficients of the product z of elements 
x and y, in the ring TV . Assuming these are computed using the more efficient 
right-hand side, in order to restrict back to ¥ p [oj]/ (f(uj)¥p[uj}), one can reduce the 
resulting polynomial z by /(w). Note however that one does not need to use a 
smaller basis a la Nogami et al. in §4.2 or §4.3 , but can reduce by f(ui) implicitly, 
without performing any computation. Indeed, letting (x,y) = X^=o x (j)y(j) > wc 
have: 



z = J2 Z ^ = + i^y))^ 1 = E _ * wi + ( x -y)EX 

j=0 i=0 i=0 i=0 

m 

(4.8) = ^-ftw* (mod /(w)). 

i=0 

Therefore the first term on the right-hand side of (4.7) vanishes, so that one need 
not even compute it. Thus using the arithmetic in TV but implicitly working modulo 
f(u>) is more efficient than performing arithmetic in TV alone. This is somewhat 
fortuitous as it means that while the multiply operation in (4.8) is not correct in 
TV , nevertheless, when one maps back to F p [u)]/(f(u)W p [u)]), it is correct. 

4.5. Application to GRPs. Since equation (4.5) is an algebraic identity, it is easy 
to see that exactly the same argument applies in the context of GRPs, and we can 
replace the formulae (4.1) with the CVMA formulae (4.2). Since reduction in the 
ring TZ = Z[T]/ (T m+1 — 1)Z[T] has a particularly nice form for GRPs, we choose 
to use the full basis for TZ and hence do not reduce explicitly modulo & m +i(T) to 
obtain a smaller basis. This also has the effect of eliminating the need to perform the 
addition of qo (or q m , or whichever term one wants to eliminate when one reduces 
modulo $ m+ i(T)), simplifying the multiplication algorithm further. Absorbing the 
minus sign into the qi, Algorithm 1 details how to multiply residue representatives. 

Remark 4.1. Observe that each component of z may be computed entirely inde- 
pendently of the others. Hence using m + 1 processors rather than 1, it would 
be possible to speed up the execution time of Algorithm 1 by a factor of m + 1, 
making it particularly suitable for hardware implementation. In §5 we consider the 
parallelisation of our reduction algorithms as well. 
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Algorithm 1: GRP MULTIPLICATION 



INPUT: x=[x m ,...,x ],y=[y m ,...,y ] e Z m+1 
OUTPUT : z = [z m , . . . , zo] G Z m+1 
where z=* m+1 (t) x-y 

1 . For i — m to do : 

2- * <- ET=i( x a-j) - ■ (va+j) - va-j)) 

3 . Return z 



4.6. Cost comparison. We here use a simple cost model to provide a measure of 
the potential performance improvement achieved by using Algorithm 1, rather than 
schoolbook multiplication of residues. We assume the inputs to the multiplication 
algorithm have coefficients bounded by b q , i.e., they each consist of q words. Let 
M(q, q) be the cost of a g-word by g-word schoolbook multiplication, and let A(q, q) 
be the cost of an ition of two g-word values. We assume that A(2q, 2q) = 2A(q, q) 
and that there is no overflow beyond 2q words in the resulting vector components, 
which one can ensure by selecting appropriate GRPs, see §7. The cost of the 
multiplication using each method is as follows. 



4.6.1. GRP schoolbook multiplication. Working modulo T m H \-T + 1 and using 

a basis consisting of m terms only, the number of coefficient multiplications is m 2 , 
while the number of double- length additions is also m 2 . Hence the total cost is 
simply 

m 2 • M(q, q) + 2m 2 ■ A(q, q). 

Note that computing the convolution (4.1) costs 

(to + l) 2 • M(q, q) + 2m(m + 1) • A(q, q), 

which is costlier since it requires embedding into 1Z, which introduces some redun- 
dancy. 

4.6.2. CVMA formulae. For each zi computing each term in the sum costs M(q, q)+ 
2A(q, q), and so computing all these terms costs ^ • (M(q, q) + 2A(q, q)). The cost 
of adding these is (^ - l)A(2q, 2q) = (m — 2) • A(q, q). For all the m + 1 terms Zi 
the total cost is therefore 

m(m + 1) , ,, s 9 „ . . , x 
'- ■ M(q, q) + 2(m 2 - 1) • A(q, q). 

Therefore by using the CVMA formulae, we reduce not only the number of mul- 
tiplications, but also the number of additions (by 2), contrary to the case of field 
extensions, for which the CVMA formulae increases the number of additions by 
nearly 50%. We have thus found an analogue of the asymptotic cyclic versus lin- 
ear convolution speed-up for multiplication modulo Mersenne numbers (see Eq. 
(6.1) of [16], for example) at small bitlcngths for which schoolbook multiplication 
is optimal, for GRPs. 
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5. GRP Reduction 



In this section we detail reduction algorithms for two types of GRPs. The first, 
Algorithm 2, assumes only that t is even, which provides the minimum possible 
restriction on the form of the resulting GRPs for any given bitlcngth. All such 
GRPs can therefore be implemented with code parametrised by the single variable 
t, which may be beneficial for some applications. Supposing that R = b q > t, then 
as with Montgomery reduction, it is more efficient to reduce components not by R 
as in (3.4) and (3.5), but by b sequentially q times. In Algorithm 2 each reduction 
therefore reduces the input's components by approximately log 2 b bits. 

The second reduction method as detailed in Algorithm 3 is a specialisation of 
Algorithm 2. It assumes that t = mod 2 l for some I > 1, and each application 
of the reduction function reduces the input's components by approximately I bits. 
Algorithm 3 is potentially far more efficient than Algorithm 2, depending on the 
form of t. Ideally one should choose a t for which I > (log 2 i)/2 so that two 
applications of the reduction function are sufficient in order to produce components 
of the desired size, which is minimal. In general for other values of I a larger 
number of reductions may be needed, which we consider in §7. In constrast to 
Algorithm 2, which is designed for generality, Algorithm 3 is geared towards high- 
speed reduction. The trade-off arising here is that there will naturally be far fewer 
GRPs of this restricted form. We also present a modification of Algorithm 3, which 
is slightly more efficient in practice, in Algorithm 4. 

5.1. GRP reduction: t even. Following §3.5, in equation (3.4) we need the 
matrix — L^ 1 : 



(5.1) 



—L~ 



1 



t m+l _ I 



t m-l 



1 

t 



t 6 t 2 
t 4 t 3 
t 5 t 4 



t 



t 



The form of L and — L^ 1 allows one to compute u = — L _1 • z mod b and L ■ u, 
computed in equation (3.5), very efficiently. Since t is even, the following vector 
may be computed. Let t[0] be the least significant digit of t, written in base 6, and 
let 

V = t[o]"+i-i [t[or '* [0]m ~ 1 ' • ' • ' * [0] ' 1] mod L 

Algorithm 2 details how to reduce a given an input vector z by 6, modulo t m+1 — 1, 
given the precomputed vector V. Observe that Algorithm 2 greatly simplifies the 
reduction algorithm originally given in [15]. This is possible since for t m+1 — 1 one 
can interleave the computation of the vectors u and w defined in (3.4) and (3.5) 
respectively. This has two benefits. First, as one computes each component of w 
sequentially, one need only store a single component of u, rather than m+1. Second, 



since when one computes L ■ u one needs to compute t ■ u 



<i+l) 



for i 



,0 (in 



line 3), one obtains t[0] • itj (the first term on right-hand side of line 4) for free 
by computing the full product t ■ tt(»+i) first. One therefore avoids recomputing the 
least significant digit of t ■ in each loop iteration. In fact one can do this for 
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Algorithm 2: rcdl b (z) 
INPUT: z = [z m ,...,zo] 

OUTPUT: redb(z) where red&(z) = t m+i_i z • b^ 1 

1 . Set u 4- (J2Zo V i ■ Z M) mod b 
1 . For i — m to do : 

3. v l 4-t-u {l+1) 

4. 4- (vi[0] - Zi[0]) mod b 

5. tOj «- (zi + Ui - Vi)/b 

6. Return w 



any polynomial t m+1 — c, with exactly the same algorithm, the only difference being 
in the definition of V, where t m+ — c becomes the denominator. For polynomials 
with other non-zero coefficients, this does not seem possible, and so Algorithm 2 
seems likely to be the most efficient Chung-Hasan reduction possible with this 
minimal restriction on the form of t. 

It is straightforward to verify that Algorithm 2 correctly produces an output 
vector in the correct congruency class, via a sequence of simple transformations 
of [15, Algorithm 3]. However we do not do so here, since we are mainly interested 
in the more efficient Algorithms 3 and 4. 

Remark 5.1. Note that in the final loop iteration, uq from line 1 is recomputed, 
which is therefore unnecessary. However, we chose to write the algorithm in this 
form to emphasise its cyclic structure. Indeed, there is no need to compute uq first; 
if one cyclically rotates V by j places to the left, then the vector w to be added 
to z in (3.5) is rotated j places to the left also. One can therefore compute each 
coefficient of redl^(z) independently of the others using a rotated definition for V 
(or equivalently by rotating the input z ). This demonstrates that a parallelised 
version of the reduction algorithm with m+1 processors is feasible. However, as each 
processor requires the least significant word of each component of z, this necessitates 
a synchronised broadcast before each invocation of the reduction function. In this 
scenario the reduction time would be proportional to the number of such broadcasts 
and reductions required, independently of m + 1. 

5.2. GRP reduction: t = mod 2 l . In the ideal case that t — 2 l , we see that such 
a GRP would be a GMN. In this case, one can use the reduction method detailed in 
§3.4 without resorting to using its Montgomery version at all. Multiplication would 
also be faster thanks to Nogami's formulae. Unfortunately, such GRPs seem to be 
very rare. It is easy to show that if t = 2 l with I > 1 and $ m+ i(i) is prime, then 
I = m+1. Testing the first few cases, we find prime GRPs for I = 2, 3, 7, 59 but 
no others for prime I < 400. Note that these primes contradict Dubner's assertion 
that no such GRPs exist [19, §2]. Since for I = 59 the corresponding GRP has 3422 
bits, this is already out of our target range for ECC, so we need not worry about 
such GRPs. 

Hoping not to cause confusion, in this subsection we now let b = 2 l where I is 
not necessarily and usually not the word size of the target architecture. We denote 
the cofactor of b in t by c (which by the above discussion we assume is > 1), so 
that t = b • c. Algorithm 3 details how to reduce a given an input vector z by 6, 
modulo t m+1 - 1. 
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Algorithm 3: red2 b (z) 



INPUT : z = [z m , . . . , z ] 6 Z m+1 

OUTPUT: red 6 (z) where rcd 6 (z) =^+1^ z • b^ 1 

1 . For i — m to do : 

2 . Wi (zi + (—Zi mod b))/b — c ■ (— zi i+ \\ mod b) 

3 . Return w 



A simple proof of correctness of Algorithm 3 comes from the specialisation of 
Algorithm 2. Since t = mod b, writing t in base b, the vector V becomes 

V d = [0,...,0,-l] mod b. 
Hence for line 1 of Algorithm 2 we have 

Mo < zq[0} mod b. 

Since in line 4 of Algorithm 2, we have Vi = mod b, we deduce that Uj = 
— Zi mod b, and hence we can eliminate Uj altogether. Each loop iteration then 
simplifies to 

«- i • mod 6) 

(5.2) Wi (zi + (— Zi mod 6) — Wj)/6 

Upon expanding (5.2), we obtain 

Wi (zi + (— Zi mod b))/b — t ■ (— 2(i+i) mod 6)/6 

= (zi + (— Zi mod &))/& — c • (— mod 6), 

as required. However since we did not provide a proof of correctness of Algorithm 2, 
we also give a direct proof as follows. Observe that modulo t m+1 — 1, we have 

m 

V> _1 (w) = ^Wjf 

i=0 
m 

= ^^[(-Zi + (~ z i mod 6))/6 — c • (— z^ i+1 ^ mod b)\t l 

i=0 

mm m 

= + mod 6))/6)t < - £((-*< i+ i) mod 6)/6)t* + 1 

m 

= z ^/ b ( mod * m+1 - *) 

i=0 

as required. In terms of operations that may be performed very efficiently, we alter 
Algorithm 3 slightly to give Algorithm 4, which has virtually the same proof of 
correctness as the one just given. 
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Algorithm 4: red3b(z) 



INPUT: z = [z m ,...,zo] £Z m+1 

OUTPUT: red 6 (z) where red b (z) ^ t m+i_i z • b^ 1 

1 . For i — m to do : 

2. toj <- Zi/b + c • mod 6) 

3. Return w 



Note that the first term in line 2 of Algorithm 3 has been replaced by a division 
by 6, which can be effected as a simple shift, while now the second term needs the 
positive residue modulo b, which can be extracted more efficiently Hence Algo- 
rithm 4 is the one we use. By our previous discussion, c necessarily has Hamming 
weight at least two for GRPs in our desired range. By using c that have very low 
Hamming weight, one can effect the multiplication by c by shifts and adds, rather 
than a multiply (or imulq) instruction. Hence for such GRPs, assuming only two 
invocations of Algorithm 4 are needed, reduction will be extremely efficient. 

Remark 5.2. Regarding parallelisation, observe that for m+ 1 processors, only the 
least significant word of Z(i+i) is passed to processor i, thus reducing the broadcast 
requirement in comparison with Algorithm 2. 



6. GRP Residue Representation 

So far in our treatment of both multiplication and reduction, for the sake of 
generality we have assumed arbitrary precision when representing GRP residues 
in Z m+1 . In this section we specialise to fixed precision and develop a residue 
representation that ensures that our chosen algorithms are efficient. Our deci- 
sions are informed purely by our chosen multiplication and reduction algorithms 
- Algorithms 1 and 4 — which we believe offer the best performance for GRPs 
for the relatively small bitlcngths which are relevant to ECC. In other scenarios 
or if considering asymptotic performance, one would need to redesign the residue 
representation and multiplication algorithm accordingly. 

For x £ {0, . . . , t m+1 — 1} we write x = [x m , . . . , x ] for its base-i expansion, 
i.e., x — J27Lo x it l - The base-t representation has positive coefficients, however 
Algorithm 1 makes use of negative coefficients, so we prefer to incorporate these. We 
therefore replace the mod function in the conversion with mods, the least absolute 
residue function, to obtain a residue in the interval [— i/2,i/2 — 1]: 

, . _ J x mod t if (x mod t) < t/2, 

^ ' \ x mod t — t otherwise. 

Using this function, Algorithm 5 converts residues modulo t m+1 — 1 into the required 
form [15, Algorithm 1]. 
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Algorithm 5: BASE-i CONVERSION i\> 



INPUT: An integer < x < t m+1 - 1 

OUTPUT: x=[x m ,x ] such that \xi\ < t/2 

and YJiL o Xit* = x (mod t m+1 - 1) 



1. For i from to m do: 

2 . Xi 4— x mods t 

3. x <— (x — Xi)/t 

4 . x x + x 

5 . Return x = [x m , . . . , Xq] 



The reason for line 4 in Algorithm 5 is to reduce modulo t m+1 — 1 the coefficient 
of t m+1 possibly arising in the expansion. Note that in this addition, x G {0, 1}, 
and hence \xi\ < t/2 for each < i < m. By construction, we in fact have 
— 1/2 < Xi < t/2 for 1 < i < m while only xq can attain the upper bound of t/2. 
There are therefore t m (t + 1) representatives in this format, thus introducing a very 
small additional redundancy. Letting k = |~log 2 i], if we assume t < 2 k — 2, so 
that [—t/2, t/2] C [— 2 k /2, 2 k /2 — 1], then the coefficients as computed above can 
be represented in two's complement in k bits. In terms of efficiency, Algorithm 5 
contains divisions by t, which requires not only time, but also space, which on some 
platforms may be at a premium. Writing t = 2 l ■ c as in §5.2, then if the cofactor 
c = 2 k ~ l — c' with c' very small, then division by t consists of a shift right by I bits 
and a division by c, which can be performed efficiently using Algorithm 1 of [13]. 

Following this conversion, it might seem desirable to define vectors whose com- 
ponents are in [— 2 fe /2, 2 k /2 — 1] to be reduced, or canonical residue representatives. 
However, for efficiency purposes it is preferable to have a reduction function which, 
when performed sufficiently many times, outputs an element for which one docs 
not have to perform any modular additions or subtractions to make reduced, as 
this eliminates data-dependent branching. A control-flow invariant reduction func- 
tion is also essential to defend against side-channel attacks, see §9. To obtain 
such a function, observe that the second term in line 2 of Algorithm 4, namely 
c • (2(i+i> mod 6), is positive, and in the worst case is k bits long. The first term, 
Zi/b, is clearly I = \og 2 b bits shorter than Zi. Since one adds these the resulting 
value may be k + 1 bits, or larger, depending on the initial length of the inputs' 
components. Furthermore, since we wish to allow negative components, in two's 
complement the output requires a further bit, giving a minimal requirement of A: + 2 
bits. We therefore choose not to use minimally reduced elements as coset represen- 
tatives in Z m+1 / ~, as output by Algorithm 5, but slightly larger elements, which 
we now define. 

Definition 6.1. We define the following set of elements of Z m+1 to be reduced: 

(6.1) I m+1 = {[x m , . . . , x ] e Z m+1 I -2 fe+1 < x t < 2 k+1 }. 

Note that the redundancy inherent in this representation depends on how close 
t is to 2 k+2 . For a modular multiplication, we assume that the inputs are reduced. 
We must therefore ensure that the output is reduced also. This naturally leads one 
to consider I/O stability, as we do in §7. 
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Once we have a reduced representative x = i/)(x) we also need to convert to the 
Montgomery domain. While one can do this in Z/(i m+1 — 1)Z before applying ip, it 
is more convenient to do so in Z m+1 / ~. Assuming q reductions by 6 are sufficient 
to ensure I/O modular multiplication stability, we precompute tp{b 2q mod <& TO+ i(t)) 
and then using Algorithms 1 and 4 compute 

x • ^(& 2 « mod * m+ i (*))/&* = $m+l(t) ${x ■ b"). 

Similarly, to get back from the Montgomery domain, again using Algorithms 1 
and 4, we compute 

With regard to mapping back from x = [x m , . . . ,x ] G to canonical residues 

in Z/$ m+ i(f)Z, one has 

m m — 1 

^2 x *t l =^2( x *~ x ™)t l (mod $ m+ i(i)), 

which can be computed efficiently by first using Horner's rule and then mapped 
to {0, . . . , $ m+ i(t) — 1} by repeated additions or subtractions. In terms of opera- 
tions required for ECC, we assume that the conversions are one-time computations 
only, with all other operations taking place in the (Montgomery) Chung-Hasan 
representation. 

7. Modular Multiplication Stability 

In this section we analyse Algorithms 1 and 4 with a view to ensuring I/O 
stability for modular multiplication. We assume the following: b = 2 l ,t = c-b where 
c < 2 k ~ l (and hence t < 2 k — 2), and that reduced elements have the form (6.1). 
Input elements therefore have components in I = [— 2 fe+1 , 2 k+1 — 1], and these are 
representable in k + 2 bits in two's complement. For simplicity and in order for our 
analysis to be as general as possible, we use the term single precision to mean a word 
base large enough to contain t — even if this in fact requires multiprecision on a 
given architecture — and double precision to mean twice this size. We assume that 
for this single precision word size w, the components of z output by Algorithm 1 
are double precision. In practice one prefers to specialise to actual single precision 
ton a given architecture, since this obviates the need for multiprecision arithmetic; 
utilising the native double precision multipliers that most CPUs possess is more 
efficient, and reduction is also faster for smaller t since fewer iterations need be 
performed. We note that in constrained environments however, multiprecision may 
however be unavoidable. 

During the multiplication, terms of the form Xi — Xj are computed, which are 
bounded by 

_ 2 k+2 + 1 < x ._ Xj < 2 k+2 _ 1; 

and which therefore fit into k + 3 bits in two's complement. The product of two 
such elements is performed, giving a result 

_ 2 2k+i + 2 k+3 _ 1 < ^ _ Xj} . (% _ yi) < 2 2fe+4 _ 2 fe+3 + lf 

which fits into 2A: + 5 bits in two's complement. One then adds m/2 of these terms, 
giving a possible expansion of up to [log 2 to/2] bits, which must be double precision. 
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m+ 1 




I 


c < 


[log 2 p] 


3 


61 


33 


2 28 


122 


5 


61 


34 


2 27 


244 


7 


60 


34 


2 26 


360 


11 


60 


34 


2 26 


600 


13 


60 


34 


2 26 


720 


17 


60 


34 


2 26 


960 



Table 2. Stable parameters: w = 64, q = 3 



m + 1 




I 


c < 


[log 2 P] 


3 


61 


23 


2 y8 


122 


5 


61 


23 


2 38 


244 


7 


60 


23 


2 37 


360 


11 


60 


23 


2 37 


600 


13 


60 


23 


2 37 


720 


17 


60 


23 


2 37 


960 



We therefore have a constraint on the size of t (in addition to the constraint t < 
2 k — 2) in terms of to: 

(7.1) [log 2 (m/2)] +2fc + 5 < 2w 

This inequality determines a constraint on the size of t, given to and w. Assum- 
ing (7.1) is satisfied, one then needs to find the minimum value of b — 2 l such that 
the result of the multiplication step, when reduced by b a specified number of times, 
say q, outputs a reduced element. This needs to be done for each (to, k) found in 
the procedure above. Any power of 2 larger than this minimum will obviously 
be satisfactory also, however minimising b maximises the set of prime-producing 
cofactors c, which as stated in §5 may be useful in some scenarios. 

In §6, we showed that one application of Algorithm 4 shortened an input's com- 
ponents by l—l bits, unless the components were already shorter than (fc+2) + (Z — 1) 
bits. Therefore stipulating that q reductions suffice to produce a reduced output, 
we obtain a bound on I in the following manner. Let 

h = [log 2 (m/2)] + 2k + 5 

Then after one reduction, the maximum length of a component is h — l+l. Similarly 
after q reductions, the maximum length is max{h — q(l — 1), k + 2}, and we need 
this to be at most k + 2. Hence our desired condition is 

h-q(l - 1) < k + 2 

Solving for I, we have 

(7.2) ^ 1+ [Wm/2)1 + fc + 3 

Using these inequalities it is an easy matter to generate triples (to+ 1, k, V) which 
ensure multiplication stability for any w and q. For example, for w — 64, Tables 1 
and 2 give sets of stable parameters for q = 2 and q = 3 respectively. 
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The final column gives the maximum bitlength of a GRP that can be represented 
with those parameters, though of course by using smaller c one can opt for smaller 
primes, and the corresponding minimum value of I reduces according to (7.2). To 
generate suitable GRPs, a simple linear search over the values of c of the desired 
size is sufficient, checking whether or not $ m+ i(2 z • c) is prime, see §10. 

8. Full GRP Modular Multiplication 

For completeness we now piece together the parts treated thus far into a full 
modular multiplication algorithm, where in Algorithm 6 we assume q reductions by 
b are required for I/O stability and in line 4 either Algorithm 2 or Algorithm 4 is 
used according to the form of b. 



Algorithm 6: GRP MODMUL 



INPUT: x= [x m ,...,xo],y = [y m ,...,yo] e F n+1 

OUTPUT: z = [z m , ...,Zo] 6 I m+1 where z =* m+1 (t) x • y • b~ q 

1. For i~m to do: 

2- * <- E^i (*<*_.,•> -*<i +j >) • (y<i +j > -va-j)) 

3. For k from to q—1 do: 

4. z^red;,(z) 

5 . Return z 



Should t be multiprecision on a particular architecture, then as with Mont- 
gomery arithmetic it may be more efficient to use an interleaved multiplication 
and reduction algorithm, as we detail in Algorithm 7. Here one needs b to be the 
word base of the underlying architecture and so in line 6, if t = (mod b) we 
use Algorithm 4, otherwise we use Algorithm 2. For x = [x m , . . . ,Xo] we write 
x t = Xi [0] + Xi[l]b + ■ ■ ■ + Xi[q - 



Algorithm 7: GRP MODMUL (interleaved) 



INPUT: x= [x m ,...,x ],y = [y m ,...,y ] e I m+1 

OUTPUT: z = [z m , . . . , z ] G I m+1 where z =$ m+l( t) x • y • b~« 

1. z^[0,...,0] 

2. For k = to q - 1 do: 

3 . For i = m to do : 

5 . z z + w 

6. z red;,(z) 

7 . Return z 
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To verify the correctness of Algorithm 7, observe that for each of the m + 1 
components of z, after the last iteration of the outer loop we have: 

m/2 q-1 

* - E(E(»<i-i)[*]-»(Hi>W)/6«- fc )-(»<i + i>-»<4-i>) 

3 = 1 k=0 
m/2 

3 = 1 

Hence when taken modulo <& m+ i(t), we see that z is congruent to: 

m m m/2 

-•»+!(*) E(E^(i-i> -^(i+^V^-^i+i) -va-3))) 

i=0 i=0 j=l 

i=0 i=l 

-*m+iW x-y-fe -9 , 

as required. As with ordinary Montgomery arithmetic, there are many possible 
ways to perform the interleaving, see [39] for example. 

9. Other arithmetic and side-channel secure ECC 

In addition to modular multiplication, one also needs to perform other arithmetic 
operations when implementing ECC point multiplication. In this section we detail 
how to perform these using our representation and briefly explain how it enables 
point multiplication to be made immune to various side-channel attacks. 

9.1. Other arithmetic operations. 

9.1.1. Addition/ subtraction. To perform an addition or subtraction of two reduced 
elements x, y, we compute the following: 

x ± y = [x m ±y m ,...,x ± y ] ■ 

Note that the bounds on each of these components is [— 2 fe+2 , 2 k+2 — 2], which are 
therefore not necessarily reduced. One could reduce the resulting element using 
the specialisation to GRPs of [15, Algorithm 5], which shows how to do this for a 
general LWPFI. Chung and Hasan refer to this process as short coefficient reduction 
(SCR), as opposed to full modular reduction. However, for ECC operations it is 
faster (and more secure) to simply ignore this expansion and rely on a later modular 
multiplication to perform the reduction, as is required when computing a point 
addition or doubling, see [5, 6] and §9.2. 

9.1.2. Squaring. When t is single precision, the CVMA formulae do not have any 
common subexpressions, as arises for ordinary integer residue squaring. In this 
case GRP squaring is performed using Algorithm 6. If t is multiprccision, then the 
components of a product x y are computed as a sum of integer squares. In this case, 
one can eliminate common subexpressions to improve efficiency by nearly a factor 
of two (in the multiplication step). On the other hand, when using Algorithm 7 and 
its variants it may be difficult to eliminate common subexpressions efficiently [39] . 



24 



ROBERT GRANGER AND ANDREW MOSS 



9.1.3. Inversion and equality check. Inversion seems difficult to perform efficiently 
in the GRP representation. If t were prime then it would be possible to use an 
analogue of the inversion/division algorithm of [52], exploiting the cyclicity t m+1 = 
1 (mod t m+1 — 1). However, for our GRPs t is even and greater than 2. One can 
therefore opt to map back to Z/<& m+ i(£)Z, remaining in the Montgomery domain, 
and perform inversion using the binary extended Euclidean algorithm (see [38] , for 
example) and modular multplying by the precomputed value tp(b 3 mod $ m+ i(t)). 
Alternatively, for data-independent inversion, one can simply power by $„ l+ i(i) — 
2, as do the authors of [6]. Using projective coordinates can obviate the need 
for inversions altogether, however for many protocols inversion is unavoidable and 
when it is avoidable, in some scenarios such representations of points should be 
randomised after a point multiplication [50] . 

Since our representation possesses redundancy, equality checking is naturally 
problematic. We therefore opt to map back to Z/$ m+ i(i)Z to check equality there 
— as for inversion — while remaining in the Montgomery domain. For ECC equal- 
ity checking is usually a one-time computation per coordinate, and so again this 
operation does not greatly impinge upon efficiency. 

9.2. Side-channel secure ECC. As we demonstrated in §7, by choosing t, I 
and m + 1 carefully, one can avoid the need to compute any final additions or 
subtractions when performing a modular reduction. This is an analogue to various 
results for ordinary Montgomery arithmetic [62, 63, 29]. The lack of a conditional 
addition/subtraction averts threats such as [64, 55], the latter of which applies 
directly to the NIST GMNs. Our modular multiplication algorithm is thus control- 
flow invariant with no data-dependent operations, making it immune to timing 
attacks [40] and simple power analysis (SPA). 

In addition to making modular multiplications and squarings immune to timing 
attacks and SPA, one can also ensure that the computation of an entire elliptic 
curve point addition or doubling is also immune. To do so, one chooses a GRP 
with t divisible by a sufficiently high power of 2, so that during the course of an 
elliptic curve point operation, even if one ignores the coefficient expansion caused 
by additions/subtractions, these do not overflow and the modular reductions ensure 
the outputs are fully reduced elements. Note that this requires b — 2 l \ t to be a few 
bits longer than the minimum /-values listed in Tables 1 and 2: for reasons of space 
we do not include the analysis here. By doing so, a point addition or doubling be- 
comes an atomic operation, where the sequence of arithmetic operations is entirely 
data-independent. In this case one only needs point-multiplication-level defences 
against timing attacks and SPA, such as the double-and-add-always algorithm due 
to Coron [18], or the use of Edwards curves, for which the addition formula can 
also be used for doubling [7]. Hence, ECC over GRPs may be straight-line coded, 
which is beneficial for both efficiency and security. 

Lastly, our representation can also be made immune to differential power analysis 
(DPA) [41]. Observe that the embedding of Z/$ m+ i(i)Z into Z/(t m+1 - 1)Z can be 
randomised by adding to it a random multiple r -$ m +i(t) for r s {0, . . . , (t — 1) — 1}. 
While our embedding is an example of 'operand scaling' [61, 52] which is used for 
faster reduction, the addition of a multiple of the modulus within a redundant scaled 
representation also acts as a countermeasure to DPA — such as Goubin's attack [28] 
on the randomised projective coordinates defence of Coron [18] — as shown by 
Smart, Oswald and Page [58]. In particular, for multiprecision integer residues the 
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authors show that this countermeasure thwarts DPA whenever the scaling factor is 
longer than the longest string of ones or zeros in the binary expansion of the initial 
modulus. For the NIST GMNs, this countermeasure requires a large scaling factor, 
making the defence inefficient and nullifying the benefits of using these moduli. 
Applying the Smart-Oswald-Page rationale to GRPs, one sees that the scaling 
factor is t — 1, while the longest string of ones or zeros in the binary expansion 
of 4> m+ i(t) is [~logt] — 1. Since GRPs already use the larger ring, we acquire this 
defence for almost negligible cost. In particular the addition of a random multiple 
r of $ m+ i(i) to an element x has the form [x m + r, x m -\ + r, . . . , xq + r], which 
only requires m + 1 additions. Since DPA depends on the ability of an attacker 
to predict a specific bit in the representation of a given field element (other than 
the upper excess zero bits in each coefficient of GRP residues, which are the same 
for every field element), if the representation of points is randomised in this way 
prior to every point multiplication, or even every modular multiplication, then DPA 
should not be feasible. 

10. GRP Parameters 

In this section we provide empirical data regarding the abundancy of GRPs at 
various bitlengths relevant to ECC. We also specify parameters that are particularly 
suitable for efficient implementation. 

10.1. Estimating the number of GRP parameters. As we saw from Tables 1 
and 2, for a given prime m + 1 and word size w, there is an upper bound on the 
length of a GRP that may be represented. Table 3 contains estimates (or exact 
counts) for the number of GRPs which are in accordance with the GRP field and 
residue representation set out in this work, for a word size w = 64 and where q = 2 
reductions suffice to ensure I/O modular multiplication stability. The data was 
obtained as follows. 

For a desired GRP p of bitlength [logp], Table 1 gives the minimum value 
of prime m + 1 which is adequate to represent GRPs of this size. The inequal- 
ity (7.1) gives k max which is the maximum bitlength of t that is representablc, 
while (7.2) gives the minimum value I required in order for t = 2 l ■ c to be I/O 
stable. We estimate t max simply as 2^ logp ^ m , which implies a maximum value for 
c of 2P°gi>l/™-*mi». Similarly for p of this precise bitlength, we estimate the mini- 
mum value of c as 2^ losp ^~ ls> / m ~ lmin . We denote this interval by 1(c). To estimate 
P(prime), which is the probability that a given generalised repunit in our form is 
a GRP, we performed a linear search over c G 7(c), counting the first 1, 000 primes 
and simply dividing by the length of the search. The estimated total number of 
GRPs satisfying our requirement that q = 2 is then given by \I(c)\ ■ P(prime). 

For each of m + 1 = 5, 7 and 11, Table 3 contains estimated counts for the largest 
GRPs representable. It also contains estimates (or exact counts) for the number 
of GRPs at the NIST GMN sizes 224, 256 and 384. We also consider bitlength 512 
rather than 521, since this conjecturally gives 256-bit security, with the larger prime 
2 521 — 1 being nominated purely for fast reduction. Observe that the number of 
available GRPs for a given m + 1 decreases as the size of p, and hence c decreases. 
The number available for bitlengths 383 and 384 is particularly low. However, 
should this be a concern for a particular application, one can see from Table 2 that 
by moving to GRPs for which 3 reductions suffices, \I(c)\ becomes much larger 
(3.71 x 10 5 ) and our estimate of the number of GRPs becomes over 5, 000. On the 
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Table 3. Estimated GRP counts for w = 64, q = 2 
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other hand, since 384 is not too far beyond the upper bound for the size of GRP 
representable by m + 1 = 7, it may be preferable to trade 12-bits of security for 
much improved performance, see §11. Similarly, 

10.2. Hamming weight 2 parameters. As we showed in §5.2, there are no suit- 
able GRPs in the ECC range for which t — 2 l . Hence the next best type of GRP 
parameter t will have Hamming weight equal to 2, where we allow c to have the 
form 2 C + 1 as well as 2 C — 1 when there is sufficient slack in the representation, 
since subtractions cost the same as additions. We list these GRPs in Table 4. The 
final column indicates whether or not the given GRP allows for atomic side-channel 
secure point additions and doublings, as per §9.2. Note that for m + 1 = 5 and 
w = 64 we can not represent GRPs any larger than 244-bits, and are thus short 
of the conjectured 128-bit ECC security level of 256-bits. One can therefore either 
move up to m + 1 = 7, which can represent GRPs of up to 360-bits, or one can 
opt to reduce security by a few bits, for better performance. Indeed, in recent 
work Kasper argues that the NIST GMN prime P-224 = 2 224 - 2 96 + 1 offers a 
satisfactory trade-off between security and efficiency, when used as the basis of the 
elliptic curve Diffie-Hellman (ECDH) key exchange in the Transport Layer Security 
(TLS) protocol [33]. Bernstein has also implemented arithmetic mod P-224 [4]. Yet 
another possibilty at this security level are the GFN primes $s(2 41 ■ (2 15 — 1)) and 
<I>8 (2 50 • (2 6 — 1)), both of which have bitlcngth 224, but experiments with such 
GFNs have not yet been carried out. 

11. Implementation and Results 

In this section we provide details of our proof-of-concept implementation and 
our results. We consider field multiplications only as this is the bottleneck for ECC 
point multiplication and hence gives an accurate indication of performance. 

In terms of performance, the fastest implementations of ECC in the literature all 
feature cycle counts for 256-bit ECC point multiplication [5, 26, 32, 46, 25, 6], except 
for Rasper's P-224 implementation [37], with [5, 6, 37] each being side-channel 
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Table 4. Approximately NIST-size fast GRPs for w — 64, q = 2 
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secure. As it is difficult to get a fair comparison between our implementation and 
these, we opt to compare our modular multiplication performance with the mpF g 
benchmarking system due to Gaudry and Thome [26]. This has been ported to 
OS-X 10.5.8 with minor changes and executed on a platform using an Intel Core 
2 Duo at 2.2Ghz. As stated in [6], to date mpF 9 gives only the fourth fastest 
implementation of ECDH, based on Bernstein's curve25519, which utilises a non- 
standard representation of residues mod 2 255 — 19 and exploits the floating-point 
unit of specific instruction-set architectures to great effect. However, by comparing 
the basic multiplication cost on the target architecture, one can obtain a crude 
estimate of the relative performance of our arithmetic with that of curve25519. 

Our implementation consists of two inline assembly operations targeted at the 
Core 2 processor. One accumulates the innermost sum of line 2 of Algorithm 6, 
while the other performs a single instance of the reduction operation in line 4 
of Algorithm 6. Both use the 64-bit operations available on the Core 2 and the 
extended register set available in x86_64. These assembly operations both use a 
mere 4 of the 15 available in the x86-64 instruction set. This allows one to rely on 
normal C code to arrange these macros, and to handle data-storage. As a result the 
gec compiler can generate all of the intermediate memory access instructions and 
schedule the usage of the other 11 registers available. This means that the same 
code can be reused for any field supported by Algorithm 6 — the only changes 
required are the parameter definitions. To generate a particular instance of the 
family of algorithms we use a simple wrapper written in Python that arranges the 
sequence of these operations required for the particular parameter choice of m + 1 
and t. 

To emphasise the relative simplicity of our implementation, we use only 64-bit 
scalar operations on the processor, and allow the compiler to schedule most of the 
output instructions. As a result we reach a throughput of slightly less than one 
operation per cycle. In comparison the mpF 9 implementation of curve25519 uses 
SSE2 to reach a throughput of almost two operations per cycle (the theoretical 
maximum on the architecture). Although our implementation is less efficient (be- 
cause we have spent less programmer time on the machine-dependent optimisation) 
the performance achieved is still higher. Scheduling a lower-level implementation 
on the processor would be an interesting challenge. 
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Table 5. mpF g cycle counts for curve25519 and Montgomery arithmetic 
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M. 
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Table 6. Cycle counts for GRP arithmetic 



Parameters 


Max size (bits) 


ModMul (cycles) 


m+ 1 = 5, HW(c) = 2 


244 


96 


m + 1 = 5, general c 


244 


112 


m+ 1 = 7, HW{c) = 2 


360 


165 


m + 1 = 7, general c 


360 


182 


m + 1 = 11, general c 


600 


340 



As explained in §5, within the reduction algorithm we have a trade-off between 
the number of GRPs available and performance. If one opts for a generic value 
of c many GRPs are available, but the reduction involves a full imulq instruction 
with relatively high latency. If we specialise our choice of c to very low Hamming 
weight then we can replace this instruction with a combination of shift and add 
instructions to improve performance. We have measured the performance of both 
implementations. To ensure a fair comparison we have merged our code into mpF 9 
so that all algorithms are being tested with the same timing code. This timer 
executes 10 6 operations in the field, measuring the elapsed time. The reported 
figures are the mean execution time for the operation. Table 5 contains cycle 
counts for Montgomery arithmetic at various bitlengths, as well as the curve25519 
modular multiplication cycle count. Table 6 contains our results for GRP modular 
multiplication. 

As stated in §10.1, the closest size of field to curve25519 that we can implement 
using m + 1 = 5 is only 244-bits. This small reduction in field size is compen- 
sated by an increase in performance, requiring only 80% of the curve25519 cycles 
per multiplication. Using the specialised reduction function for the 243-bit GRP 
$ 5 (2 59 • (2 2 — 1)), this figure improves to 69%. Since the results for the first line 
of Table 6 apply also to Hamming weight 2 GRPs smaller than 2 243 , we obtain the 
same modular multiplication performance, while utilising the acquired slack in the 
representation to ensure atomic point doublings/additions as per §9.2, in particular 
for the 228-bit GRP $ 5 (2 54 • (2 3 - 1)). At bitlength 512 with general c, compared 
to Montgomery multiplication, GRP multiplication costs only 35% as many cy- 
cles. At bitlength 600, this proportion would naturally be even smaller, however at 
this size Karatsuba multiplication may be faster than schoolbook arithmetic. We 
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thus expect that point multiplications at 224-bits and 512-bits using GRPs to be 
competitive with the state-of-the-art in the literature. 

We freely admit that our proof-of concept implementation has not been opti- 
mised, and therefore believe that one could obtain significantly better performance 
figures. By comparing our arithmetic with the modular multiplication used in [6], 
which is the benchmark for point multiplication at the 128-bit security level, one 
gains an idea of the potential performance of arithmetic mod $s(2 54 • (2 3 — 1)) 
for example. In [6], residues are also represented by five 64-bit words. Residue 
multiplication requires 25 mul instructions, as well as some imul, add and adc in- 
structions. In comparison, to multiply x and y in our representation, the CVMA 



formulae are as follows: 














zo = 


(x 4 - 


-xi)(yi - 


- y 4 ) + {x 3 - 


-x 2 ){y2 - 


-2/3), 


Zl = 


(X2 - 


-X 4 )(V4 - 


- 2/2) + {xi - 


- x ){yo - 


-yi), 


Z2 = 


(xo - 


-2=2)0/2 - 


-yo)- 


- {x 4 - 


- £3X2/3 - 


- 2/4), 


Z3 = 


(%3 - 


- x )(y - 


-2/3)- 


- {X 2 - 


- xi)(yi - 


- 2/2), 


Z4 = 


(xi - 


-x 3 )(y 3 - 


-yi)- 


- {xo - 


- ^4X2/4 - 


- 2/0), 



requiring only 10 mul, 25 add and 5 adc instructions. Since the respective reduc- 
tion algorithms are quite similar with both requiring two rounds of shifts, masks 
and additions, one expects the GRP modular multiplication to be considerably 
faster, when optimsed. It is also possible that an optimised implementation of 
multiplication mod the m + 1 = 7 GRPs listed in Table 4 may be faster than [6] , 
since it requires 21 mul instructions, rather than 25. However, since this paper is 
predominantly expositional, we leave such optimisations as open research. 

12. Conclusion 

We have proposed efficient algorithms for performing arithmetic modulo a large 
family of primes, namely the generalised repunit primes. The algorithms are simple 
to implement, are fast, are easily parallelisable, can be made side-channel secure, 
and all across a wide range of field sizes. The central contribution of this work is 
the development of the necessary theory, covering field and residue representation, 
as well as novel algorithms for performing efficient multiplication and reduction in 
these fields. We have also presented proof-of-concept implementation results which 
provide an empirical comparison with other results in the literature, ensuring a fair 
comparison by reusing the same benchmarking procedure. Against Montgomery 
arithmetic we show an approximate three-fold increase in performance, and expect 
optimised implementations of point multiplications using our proposed family to be 
competitive with the state-of-the-art in the literature. We thus present a compelling 
argument in favour of a new approach to the secure and efficient implementation 
of ECC. 
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